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Abstract 

The prominent Bernstein - von Mises (BvM) Theorem claims that the 
posterior distribution is asymptoticaUy normal and its mean is nearly 
the maximum likelihood estimator (MLE), while its variance is nearly 
the inverse of the total Fisher information matrix, as for the MLE. This 
result is usually used to justify elliptic credible sets built by Bayes simu- 
lations. This paper revisits the classical result from different viewpoints. 
Particular issues to address are: nonasymptotic framework with just one 
finite sample, possible model misspecification, and a large parameter di- 
mension. It appears that the BvM result can be extended to any smooth 
parametric family provided that the dimension p of the parameter space 
satisfies the condition "p^/n is small". 
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1 Introduction 



The prominent Bernstein - von Mises (BvM) Theorem claims that the posterior mea- 
sure is asymptoticahy normal with the mean close to the maximum likelihood estimator 
(MLE) and the variance close to the variance of the MLE. This explains why this result 
is often considered as the Bayes counterpart of the frequentist Fisher Theorem about 
asymptotic normality of the MLE. The BvM result provides a theoretical background for 
different Bayesian procedures. In particularly, one can use Bayesian computations for 
evaluation of the MLE and its variance. Also one can build elliptic credible sets using 
the first two moments of the posterior. The classical version of the BvM Theorem is 
stated for the standard parametric setup with a fixed parametric model and large sam- 
ples. Modern statistical problems require to consider situations with very complicated 
models involving a lot of parameters and with limited sample size. There is a number 
of papers in this direction recently appeared. We mention Ghosal et al. (2000); Ghosal 
and van der Vaart (2007) for the i.i.d. models, Ghosal (1999), Ghosal (2000) for high 
dimensional linear models, Boucheron and Gassiat (2009), Kim (2006) for non-Gaussian 
models, Shen (2002), Castillo (2012) for the semiparametric version of the BvM result, 
Kleijn and van der Vaart (2006), Bunke and Milhaud (1998), Kleijn and van der Vaart 
(2012) for the misspecified parametric case, among many others. All the mentioned re- 
sults require some special parametric structure, mainly i.i.d. structure or model linearity 
w.r.t. the parameter, as well as large samples. 

This paper attempts to develop a unified framework for the BvM result under quite 
general conditions allowing for finite samples, model misspecification, and large param- 
eter dimension. A finite sample non- asymptotic setup is very attractive because one can 
deal with just one sample of any structure. The main problem is that the majority of 
methods and tools from the probability theory involve some asymptotic considerations. 
Any finite sample approach requires to change most of mathematical devices. There are 
only few general results in statistical inference which are stated for finite samples; see 
Boucheron and Massart (2011) and references therein in context of i.i.d. modeling. This 
paper adopts the novel bracketing approach from Spokoiny (2012) and illustrates the 
power of this general methodology. In particular, we present a version of the BvM result 
which claims near normality of the posterior distribution under some mild regularity 
conditions on the considered parametric family. In the i.i.d. case the result applies even 
if the dimension p of the parameter space grows with the sample size n . A polynomial 
growth p^i Ti? is admissible as long as 7 < 1/3 . 

Another important issue is a possible model misspecification. The classical parametric 
theory is essentially parametric in the sense that it requires the parametric assumption to 
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be exactly fulfilled. Any violation of the parametric specification may destroy the BvM 
result. This study admits from the very beginning that the parametric specification is 
probably wrong. This automatically extends the applicability of the proposed approach. 
In particularly, it allows for a unified treatment of parametric and nonparametric models. 

First we specify our set-up following to Spokoiny (2012). Let Y denote the observed 
data and IP mean their distribution. A general parametric assumption (PA) means that 
IP belongs to p -dimensional family [Pe^O £ CI ]RP) dominated by a measure /Xg • 
This family yields the log-likelihood function L(0) = L{Y,0) =^ log^(l") . The PA 
can be misspecified, so, in general, L[0) is a quasi log-likelihood. The classical likelihood 
principle suggests to estimate 9 by maximizing the function L{6) : 

argmaxL(0). (1-1) 

eee 

If iP (-^e) 1 then the (quasi) MLE estimate 6 from (1.1) is still meaningful and it 
appears to be an estimate of the value 0* defined by maximizing the expected value of 
L{e): 

9* =^ argmaxiE'L(0). (1.2) 

0* is the true value in the parametric situation and can be viewed as the parameter of 
the best parametric fit in the general case. Spokoiny (2012) states the following extension 
of the prominent Fisher expansion of the qMLE 6 : 

Do(e-e*)^$ = D,'vL{e*), (1.3) 

where VL(0) = ^{6) and =^ -V'^IEL{e*) is the analog of the total Fisher in- 
formation matrix. In the classical situation, the standardized score $, is asymptotically 
standard normal yielding asymptotic root-n normality and efficiency of the MLE . 

Now we switch to the Bayes setup with a random element following a prior measure 
n on the parameter set . The posterior describes the conditional distribution of i? 
given Y obtained by normalization of the product exp|L(0)|7T((i0) . This relation is 
usually written as 

^\Y (X ex-p{L{e)} n{de). (1.4) 

An important feature of our analysis is that L{6) is not assumed to be the true log- 
likelihood. This means that a model misspecification is possible and the underlying data 
distribution can be beyond the considered parametric family. In this sense, the Bayes 
formula (1.4) describes a quasi posterior. Section 2 studies some general properties of 
the posterior measure focusing on the case of a non-informative prior 77 . The main 
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result claims that the distribution of Dq^'S — 0*) — ^ given Y is nearly standard normal. 
Comparing with (1.3) indicates that the posterior is nearly centered at the qMLE G and 
its variance is close to Dq^ . So, our result extends the BvM theorem to the considered 
general setup. Section 2.7 comments how the result can be extended to the case of any 
regular prior. The study is nonasymptotic and all "small" terms are carefully described. 
This helps to understand how the parameter dimension is involved and particularly ad- 
dress the question of a critical dimension. It appears that in the special i.i.d. case with 
n observations the presented BvM result continues to hold as long as p^/n is small. 

The paper is organized as follows. Section 2 states our main results including the lower 
and upper bounds for the integral w.r.t. the posterior measures and the tail posterior 
probability. The main result is the BvM Theorem stated in Section 2.9. The results are 
illustrated on the cases of an i.i.d. sample. In particular, we show that the BvM result 
holds if the sample size n is larger in order than the cube of the parameter dimension p . 
In this case the classical asymptotic BvM statement follows immediately from our result. 
The conditions for our results are collected in Section A, and the proofs are collected in 
Section B of the Appendix. 

2 Main results 

This section presents our main results which generalize the Bernstein - von Mises Theo- 
rem. First we discuss the special case of a non-informative prior given by the Lebesgue 
measure '7t{6) = 1 on ]RP . Then we extend the results to any prior with a continu- 
ous density on a vicinity of the central point 6* . The study uses some recent results 
from Spokoiny (2012). Section 2.1 provides a brief overview of the bracketing and upper 
function devices required for our study. 

2.1 Main tools. Local bracketing and upper function devices 

Introduce the notation L{6, 9*) = L[0) — L[6*) for the (quasi) log-likelihood ratio. The 
main step in the approach of Spokoiny (2012) is the following local bracketing result: 

L,{e,e*)-o,<L{e,e*)<u{e,e*) + o„ eeOo. (2.1) 

Here Lg(0,0*) and l^^{6,0*) are quadratic in — 6* expressions, {>e and {>g are small 
errors, and Oq is a local vicinity of the central point 6* . This result can be viewed 
as an extension of the famous Le Cam local asymptotic normality (LAN) condition. 
The LAN condition considers just one quadratic process for approximation of the log- 
likelihood L{6) . The use of the bracketing device with two different quadratic expressions 
allows to keep control of the error terms Oe, {>e even for relatively large neighborhoods 
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6*0 of 6* while the LAN approach is essentially restricted to a root-n vicinity of 6* . 
Another benefit of the bracketing approach is that it applies to an arbitrary parameter 
dimension and a possible model misspecification. The bracketing bound (2.1) requires 
some conditions which are listed in Section A of the Appendix. Similarly to the LAN 
theory, the bracketing result has a number of remarkable corollaries like the Wilks and 
Fisher Theorems; see Spokoiny (2012). Below we show that the BvM result is in some 
sense also a corollary of (2.1). 

For making a precise statement, we have to specify the ingredients of the bracketing 
device. It involves two symmetric p x p -matrices 

1)2 ^ -V^]EL{e*), = Var{VL(r)}, 

the radius tq , and two non-negative constants e = {6, g) which are related to the 
conditions of Section A. The local vicinity 6'o(ro) of the central point 0* is defined as 
6'o(ro) = {0 : \\Vo{0 — 0*)\\ < tq} . The bracketing result reads as follows: 

M0,r)-Oe(ro) <L(0,r) <L,(0,r) + Oe(ro), ^ G 0o(ro), (2.2) 

where 

L^(6>,r) = {e -e*yvL{e*) - \\D^{e-e*)f/2 
= e,D,{e-e*)-\\D,{e-e*)f/2 

with 

Dl = Dl{l -5)- evi, D-'vL{e*), 

and similarly for e = — e = {—S, —g) . The error terms {>e(ro) and {>e(ro) follow the 
probability bound given in Proposition 3.7 of Spokoiny (2012). The bracketing bound 
(2.2) becomes useful if these errors are relatively small and can be neglected. 

The local bracketing result (2.2) has to be accompanied with a large deviation bound 
which ensures a small probability of the event 6 6'o(ro) ■ Spokoiny (2012) explains how 
such a bound can be obtained by the upper function approach. An upper function u{0) 
ensures (with a high probability) that L{6,0*) < —u{6) uniformly in € \ 6'o(ro) . 
More precisely, given x > 0, there exist a random set with JP|i7(x)} > 1 — e~^ 

and a function u{6) = u{0, x) such that on f2{x) , it holds 

L{e,e*)<-u{0), ee9\eo{ro). (2.3) 

The further study will only use (2.2) and (2.3). We impose the same conditions as in 
Spokoiny (2012) and the statements (2.2) and (2.3) are Theorem 2.1 and Theorem 4.1 
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of that paper. However, at one point there is an essential difference. The results of 
Spokoiny (2012) are stated for finite samples. Below we consider an asymptotic setup 
with growing dimension p . The asymptotic parameter is denoted by n and can be 
viewed as the sample size with n — )• oo . We assume that all considered objects depend 
on n including the likelihood function, the parameter set and its dimension p , as 
well as all the constants in our conditions. This especially concerns the constants 6 — (5^ 
and Q = Qn which are shown there. Below we assume that they are small in the sense 
that (5„ — 7- and Qn ^ ^ as n — t- oo at a certain rate. The primary goal of our study is 
to fix the necessary and sufficient conditions on this rate which ensures the BvM result. 

Below by C is denoted a generic absolute constant, not necessarily the same. We also 
suppose a growing sequence x„ with x„ < pn to be fixed. In typical situation one can 
fix x„ = Cpn ■ By Qn we denote a random event of dominating probability such that 

The next theorem presents some sufficient conditions for (2.2) and (2.3) and it is proved 
in a bit more general form in Spokoiny (2012). 

Theorem 2.1. Let p = p„, — t- oo . Suppose the conditions (EDq) , (EDi) , (£o) ; oi'^-'i 
(X) from Section A with Tq > Cp for a fixed constant C . Let also the constants e = (5, g) 
fulfill Q > 2>VQUj{r) and 5 > 5{r) . Then (2.2) holds in which the error terms {>e(i"o) 
and <C>e(ro) fulfill on a random set Qn of dominating probability 

<}e{ro)<CQp, 0,{ro)<CQp. (2.4) 

Moreover, the random vector fulfills on Qn 

Uef < Cp. (2.5) 

Furthermore, assume {Ex) and (£r) with b(r) = b yielding 

-iEL{e,e*) >h\\Vo{e - e*)\\'^ 

for each £ \ 6'o(ro) . Let also Tq > Cp/b^ and g(r) > Cb for all r > ro ; see (Er) . 
Then (2.3) holds on a random set Qn with u{0) = b ||yo(^ - ^*)lP/2 • 

In what follows we will assume that the conditions of this theorem are fulfilled and 
will use the statements (2.2) and (2.3) and the bounds (2.4) and (2.5). This is all what 
we need for our BvM study. Define also 
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Here ||^||cxd means the spectral norm or equivalently the maximal eigenvalue of a sym- 
metric matrix A> 0. It holds under the local identifiability condition Dq > ci^VQ ; see 
(X) of Section A: 

Te<6 + ga^. 

So, Te is small, if 6, g are small. 

The results of Theorem 2.1 yield a number of important corollaries. We present one 
of them concerning an expansion of the qMLE 6 ; see Corollary 3.4 in Spokoiny (2012). 

Corollary 2.2. Suppose the conditions of Theorem 2.1. Then it holds on Qn 

\\D,{e - e*) - < 2zA,(ro), (2.6) 

where 

A,{ro) = Oeiro) + Oe(ro) + {Uef - ll4in/2 (2.7) 

satisfies 

^e(ro) <CTeP. 

The value Z\e(ro) called the spread can be viewed as the error induced by the brack- 
eting device. The Wilks and Fisher results from Spokoiny (2012) apply for growing p 
under the condition "p~^Z\£(ro) is small". Indeed, the quadratic form is of order 

p and the expansion (2.6) is meaningful if A^/p x Tg is small. 

2.2 Local Gaussian approximation of the posterior. Upper bound 

Our aim is to show that the distribution of D^i^'d — 6^ given Y is nearly standard 
normal, where 

and ^£ = D~^V L{6*) . Together with the expansion D^{6 — 0*) — ~ this yields 
that D^(6 — ~ and the BvM result claims that D^{i} — 0) is nearly standard 
normal given Y . 

For any nonnegative function / , the bracketing bound (2.2) yields: 

/ e^Y^{L{e,e*)] f{D,{e-e,))de 

< e^^ [ exp{L,(0, e*)} f{D,{e - 6,)) dO. 
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Similarly, with 6^ = 0* + D^^^^ 

[ exp{L{e,o*)}f{D,ie-o,))de 

^eo(ro) 

> e-^^ / exp{L,(0, 9*)}f{D,{e - 9,)) dO. 

The main benefit of these bounds is that both Lg(0,0*) and Lg(0,0*) are quadratic 
in 6 . This enables to explicitly evaluate the posterior and to show that the posterior 
measure is nearly Gaussian. In what follows 7 is a standard normal vector in . 

Theorem 2.3. Suppose the conditions of Theorem 2.1. Then for any nonnegative func- 
tion /(•) on ]RP , it holds on fi^ 

lE[f{D,{^ - 6,)) Eji? G 0o(ro)} 1 1^] < exp{z\+(ro)} lEf{f), (2.8) 

where 

Z\+(ro) = Z\,(ro) +log{det(L>,Z)-i)} +z/,(ro), 
with Z\e(ro) from (2.7), and for 7 ~ ?\f(0, /p) 

u.iro) = -loglP{\\VoD^\-f + i,)\\<ro\Y). (2.9) 
Furthermore, on J7„ , it holds Z\^(ro) < Cr^p. 

The condition — t- allows us to ignore the exp-factor in (2.8) and this result yields 
an upper bound lEf{'y) for the posterior expectation of f[D^{'d — 0^)) conditioned on 
Y and on i9 E 6'o(ro) . The lower bound requires some additional results on the tails of 
the posterior. 

Later we use a special case of this bound with f{u) = exp(A^it) for a fixed vector 
A G iRP . It is convenient to introduce local conditional expectation: for a random 
variable rj , define 

]E°r]'^= lE\r]l{^ e Ooiro)} \Y . 

The bound (2.8) reads as 

E^fiX'D.i^ - e,)) < exp{Z\+(ro)}iE/(7). 

-|- I ~r 1 2 

The next result considers a special case with f{u) = exp(A u) and f{u) = A u\ . 
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Corollary 2.4. For any A G ]RP 

\oglE°e^p{X^D,{^-e,)} < \\Xf/2 + At{ro), (2.10) 
IE°\\^De{'d-e^)\'^ < ||Af expZ\+(ro). 

Moreover, if ||A|p <p, then 

logiB°exp{ATD,(i9-0,)} < \\Xf/2 + Af{ro), (2.11) 
lE°\X^D,ii}-e,)f < ||AfexpZ\f(ro), 

with 

Afiro) Z\+(ro) + T,p/2 + 2r,//2||^^||. 

On J7.„ one obtains Z\®(ro) < Cr^p- 

Similarly to the previous result, the condition r^p — t- ensures that the terms 
Z\+(ro) in (2.10) and Af{To) in (2.11) are negligible. 

2.3 Tail posterior probability and contraction 

The next major step in our analysis is to check that "d concentrates in a small vicinity 
6*0 = 6*0 (I'd) of the central point 6* with a properly selected ro . The concentration 
properties of the posterior will be described by using the random quantity 

Obviously iP|i9 6'o(r'o) | i^} < p{^o) ■ Therefore, small values of p(ro) indicate a 
small posterior probability of the set 0\0o . 

Theorem 2.5. Suppose the conditions of Theorem 2.1. Then it holds on i7„ 

p(ro) < exp{Oe(ro) + ^e(ro)} ^^fW > b^o), 
with v^{tq) from (2.9). Similarly, for each m > 

p^{r,) 1E\\\DJ,{> - e,)r 0o(ro)} 1 1^] 

< exp{0^(ro) + .,(ro)} ^^^[Hir 2(117^ > brg)] . 

This result yields simple sufficient conditions on the value rg which ensures the 
concentration of the posterior on 6'o(ro) . 
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Corollary 2.6. Assume the conditions of Theorem 2.5. Let, in addition, Vq > ai-Do • 
There exists C = C(oi) depending on ai only, such that the inequality brg > Cp ensures 

/Om(ro) < Ce-"", m = 0,l,2. (2.13) 

2.4 Local Gaussian approximation of the posterior. Lower bound 

Now we present a local lower bound for the posterior probability. The reason for separat- 
ing the upper and lower bounds is that the lower bound also requires a tail probability 
estimation; see (2.13). 

Theorem 2.7. Suppose the conditions of Theorem 2.1. Then for any nonnegative func- 
tion /(•) on IRP , it holds on i7„, 

IE°f{Dj,^-e,)) > exp{-Zl-(ro)}iB{/(7)]l(||7ll < Cro)}, (2.14) 

where 

Z\-(ro) = Z\e(ro) + \og{dei{DJD~^)] + p(ro). 

On Qn one obtains Z\~(ro) < Cr^p. 

As a corollary, we state the result for the moment generating function of D^(t!} — 0^) 
corresponding to an exponential function / . Here we assume p = Pn large and x„ < pn . 
We also need an additional condition that the rg > Cp for C sufficiently large. 

Corollary 2.8. Let Tq > Cp . For any A e SF with ||A|p < p 

logiE°exp{ATL>e(i?-0e)} > \\\\\'^/2-Af 

= ^e"(ro) + e"-. 

On Qn it holds Z^f (ro) < Cr^p. 

2.5 Bernstein - von Mises Theorem 

An important feature of the posterior distribution is that it is entirely known and can 
be numerically assessed. If we know in addition that the posterior is nearly normal, it 
suffices to compute its mean and variance for building the concentration and credible 
sets. Define 

^ =^ 1E{^ \Y), ©2 = Coy{'&) =^ IE{{'d - '^){'d - ^)^ I Y]. 

Both quantities are data dependent. This section presents a version of the BvM result 
in the considered nonasymptotic setup which claims that •& is close to the MLE 6 , 6^ 
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is nearly equal to Dq^ , and ©^^(i? — i?) is nearly standard normal conditionally on 
Y . The results are entirely based on our obtained statements from Corollary 2.4 and 
2.8. Below Z\* max|Z\®,Z\®} . As previously, we assume that p = p„ — )• oo but 
TePn — ^ . We already know that Z\* < Cr^p on a set of dominating probability. 

Theorem 2.9. Suppose the conditions of Theorem 2.1. It holds on Qn 

\\DQ{^-e)f < ca:<ct,p, 
\\ip-Doe^Do\\^ < ca:<ct,p. 

Moreover, for any X € with ||A|p < p 



log IE \exp{\^&-^{'&-i&)}\Y] - \\Xf/2 



< CAl < CT^p. 



(2.15) 



We conclude that all our results including the BvM Theorem require " Tg p is small" 



2.6 Bayesian credible sets and frequentist confidence sets 

This section discusses a possibility of using the Bayesian credible sets as frequentist 
confidence sets. In general, the Bayesian credible sets are defined as level set of the 
posterior density. The BvM result suggests to build such sets in the elliptic form: 

cii) = {e:\\6-\e-^)f<i}. 

This set is random via the random posterior mean i? and posterior covariance 6^ . If 3 
is a proper quantile of the chi-squared distribution Xp i e-g- ^{Xp > 3a) = o: , then the 
posterior mass of the set C{$) is about 1 — a. Such credible sets are often considered 
as Bayesian counterpart of the frequentist confidence sets. This raises a natural question 
about the coverage probability iP(0* G C{^)) of the set C{}) . The result of Theorem 2.9 
tells us that in the definition of this set one can replace -d with 0^ = 0* + D~^^^ and 
with D^. This means that C{^a) covers 6* with the probability about 1 — a 
provided that < 3q. In the classical asymptotic setup with a correctly specified 

parametric model, is nearly standard normal, and the set £(3a) can indeed be used 
as an asymptotical a -confidence set. Moreover, Spokoiny (2012) argued that even for 
finite samples, the deviation probability iP(||^£|P > p + x) of the quadratic form 
is very similar to the similar deviation probability for a Gaussian quadratic form. This 
justifies the use of C(3) as a frequentist confidence set. 

The situation is changed if the parametric model is misspecified. This can lead to 
two different problems. First, the matrices Dq and Vq can differ from each other. Then 
the covariance matrix of the vector is close to Dq^Vq D^^ ; see Spokoiny (2012). 
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The deviation probability JP(||^g|P > p + x) also depends on this matrix Dq^Vq D^^ , 
and it is nearly the probability JP(||L'|^^Vo7|p > p + x) with a standard normal vector 
7 ~ N(0,/p) . This yields that the coverage probability of the set C{}) is driven by the 
maximum of two Gaussian probabilities iP(||D(^"'^yo7lP > i) and iP(||7|p > 3) . 

Another issue related to model misspecification is a possible modeling bias. The 
target 6* is defined as the best parametric fit via the optimization problem 9* = 
argmaxg 1EL{0) . This value can differ from the underlying model parameter if 
{Pe) ■ The modeling bias can naturally be measured by ]E\og{dIP / (HPq*) . If this value 
significantly exceeds the dimension p , the bias- variance trade-off is destroyed and the 
true parameter would lie outside the Bayesian credible set with a high probability. 

2.7 Extension to a continuous prior 

The previous results for a non-informative prior can be extended to the case of a general 
prior n[d6) with a density vr(0) which is uniformly continuous on the local set 6'o(ro) . 
More precisely, let 7r(0) satisfy 

0e0o(ro) ^(^ ) eee vr(t^ ) 

where a is a small constant while C is any fixed constant. Then the results of Theo- 
rem 2.3 through 2.9 continue to apply with an obvious correction of the error Z\* . 

As an example, consider the case of a Gaussian prior 77 = !N(0, with the density 
7r(0) oc expj — ||G0|p/2} . In addition, suppose that the value ||G0*|| is bounded by a 
fixed constant. Then 

log 4^ = -\\Gef/2 + 11V2 = {e- 0*^0^0* - \\G{e - e*)f/2, 
TT{e ) 

and the condition (2.16) is fulfilled if \\G{9 — 6*)\\ is a small number for all 9 G Oq^tq) . 
The non-informative prior can be viewed as a limiting case of a Gaussian prior as G — ?• . 
We are interested in quantifying this relation. How small should G be to ensure the BvM 
result? 

It is obvious from the definition of 6'o(ro) that 

\\G{9-9*)\\ = WGV^-^Voie - 9*)\\ < IIGFo-iooro. 

Similarly 

\{9 - 9*yG'^e*\ < \\Ge*\\ ■ 11^(61 -6>*)|| < \\G9*\\ ■ \\GVQ-^\\ooro. 

Therefore, (2.16) effectively requires that ||GyQ~^ ||oo is small. A proper choice of rg 
is given by t'q = Cp yielding the rule " 1 1 G" Vq | | oo 

p-"^/^ is small" . 
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Theorem 2.10. Suppose the conditions of Theorem 2.1. Let also U = [N"(0, G ^) he a 
Gaussian prior measure on ]RP such that 

||G0*||<C, G^<CreV'o^ 
and T^p is small. Then the BvM result (2.15) of Theorem 2.9 holds. 

2.8 Applications to the i.i.d. case 

This section comments how the previously obtained general results can be linked to the 
classical asymptotic results in the statistical literature. The nice feature of the whole 
approach based on the local bracketing is that all the results are stated under the same 
list of conditions: once checked one can directly apply any of the mentioned results. 
Spokoiny (2012) considered three typical examples: i.i.d., GLM, and median regression 
models. Section 5 of that paper presents some mild sufficient conditions ensuring the 
general Wilks and Fisher results. Here we briefly discuss how the BvM result can be 
applied to one typical case, namely, to an i.i.d. experiment. 

Let Y = (Yi, . . . ,Yn)^ be an i.i.d. sample from a measure P. Here we suppose 
the conditions of Section 5.1 in Spokoiny (2012) on P and {Pe) to be fulfilled. We 
admit that the parametric assumption P G (Pg, £ 0) can be misspecified and consider 
the asymptotic setup with n growing to infinity and simultaneously p = Pn growing to 
infinity. The bracketing bound and the large deviation result apply if the sample size n 
fulfills n > Cpn for a fixed constant C . It appears that the BvM result requires a stronger 
condition. Indeed, in the regular i.i.d. case it holds (5(ro) r^j^fn and ^*(ro) r^^j ^fn. 
The radius tq should fulfill x\ > Cpn to ensure the large deviation result. This yields 

Te > C{(5(ro) + Q{ro)} > C^p^/n. 

The BvM requires the condition " p„ is small" , which effectively means that p^/n — )• 
with n . 

Theorem 2.11. Suppose the conditions of Theorem 5.1 in Spokoiny (2012). Let also 
p'^/n —7- 0. Then the result of Theorem 2.9 holds with Dq = n¥g* , where ¥g* is the 
Fisher information of {Pe) at 6* . 

A Conditions 

Below we collect the list of conditions which are systematically used in the text. The 
list is essentially the same as in Spokoiny (2012), and Section 5 in that paper explains 
how the conditions can be checked for i.i.d. and generalized linear models, as well as for 
median regression. The whole list can be split into local and global conditions. 
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A.l Local conditions 

Local conditions describe the properties of L{0) in a vicinity of the central point 6* 
from (1.2). Define the stochastic component C{0) of L{9) : 

({0) = L{e) - iEL{e). 

Below we suppose that the random function Ci^) is differentiable in and its gradient 
VC{9) = dC,{6)/dO G IRP has some exponential moments. Our first condition describes 
the property of the gradient VC(0*) at the central point 6* . 

(EDq) There exist a positive symmetric matrix Vq , and constants g > 0, > \ 
such that Var{VC(0*)} < and for all |A| < g 



sup logJEexp/A ^ „y/^^ U < vl\^/2. 
reMp [ WolW J 



In typical situation, the matrix Vq can be defined as the covariance matrix of the 
gradient vector VC(0*) : = Var(VC(0*)) = Var(VL(0*)) . If L{e) is the log- 
likelihood for a correctly specified model, then 6* is the true parameter value and Vq 
coincides with the total Fisher information matrix. In the i.i.d. case it is equal to n¥ 
where n is the sample size while F is the Fisher matrix of one observation. 

The matrix Vq shown in this condition determines the local geometry in the vicinity 
of 6* . In particular, define the local elliptic neighborhoods of 0* as 

0o(r) = {e€&: \me - e*)\\ < r}. (A.l) 

The further conditions are restricted to such defined neighborhoods 0oi^) ■ 

(EDi) For each r < tq , there exists a constant w(r) < 1/2 such that it holds for all 
e G 0o(r) 

,,,^^J,fmm_^mm\ < „..V2. lAi < s. 

i&lRp I a;(r)||Vb7|| J 

Here the constant g is the same as in (EDq) . 

The bracketing result (2.2) requires second order smoothness of the expected log- 
likelihood lEL{e). By definition, 1(9*, e*) = and VlEL{e*) = because 6* is the 
extreme point of ]EL{9) . Therefore, —]EL(6,6*) can be approximated by a quadratic 
function of 0—6* in the neighborhood of 0* . The next condition quantifies this quadratic 
approximation from above and from below on the set 6'o(r) from (A.l). 
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(/Co) There are a symmetric strictly positive- definite matrix Dq and for each r < tq 
and a constant 5{r) < 1/2, such that it holds on the set 0o{^) 



\Do{e-e* 



<d{r) 



Usually Dq is defined as the negative Hessian of 1EL{0) at = 0*: Dq = 
-V'^IEL{0*) . If L{0,e*) is the log-likelihood ratio and IP = JPg* then -]EL{0,0*) = 
Ee* \og{d]P0* / dPe) = XiPe-.Fe) , the Kullback-Leibler diver gence between IPg* and 
IPg . Then condition (£o) with Dq = Vq follows from the usual regularity conditions 
on the family (Pg) ; cf. Ibragimov and Khas'minskij (1981). In the important special 
case of an i.i.d. model one can take t<;(r) = a;*r/?i^/^ and 6{r) = 5*r/n^/^ for some 
constants oj*,6* ; see Section 5 of Spokoiny (2012). 

The identifiability condition relates the matrices Dq and Vq . 

(X) There is a constant o > such that o^Dq > Vq . 
A. 2 Global conditions 

The global conditions have to be fulfilled for all lying beyond 6'o(ro) . We only impose 
one condition on the smoothness of the stochastic component of the process L{0) in term 
of its gradient, and one identifiability condition in terms of the expectation 1EL{0, 0*) . 

The first condition is similar to the local condition {EDq) and it requires some ex- 
ponential moment of the gradient VC(0) for all 0^0. However, the constant g may 
be dependent of the radius r = ||Vo(^ — ^*)|| • 

{Er) For any r , there exists a value g(r) > such that for all A < g(r) 



sup sup logiEexp/A^^^^l < z^^AV2. 
0e0o(r) leMp [ WoJW J 

The global identification property means that the deterministic component ]EL{0, 6*) 
of the log-likelihood is competitive with its variance Yar L{0,0*) . 

(Cr) There is a function b(r) such that rb(r) monotonously increases in r and for 
each r > ro 

inf \1EL(0,0*)\ > b(r)r2. 

B Proofs 



Here we collect the proofs of the main results. 
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B.l Proof of Theorem 2.3 

We use that Le(0, 6*) = $J DeiO - 6*) - \\De{e - 9*)f /2 is proportional to the density 
of a Gaussian distribution and similarly for 'h^{6,6*) . More precisely, define 

me{$e) = -||CellV2 + log(detDe) -plog(\/2^). (B.l) 

Then 

m,(^J+Le(0,r) 

= -\\D,{e-e,)f/2 + log{detD,)-plog{V2^) (B.2) 

is (conditionally on Y ) the log-density of the normal law with the mean 0^ = D^^$,^+0* 
and the covariance matrix . Change of variables u = Df^{6 — 0^) implies by (B.2) 
for any nonnegative function / that 

/ exp{L(0, r) + me(^ J} f{D,{e - 6,)) dO 

J0O 

< e^^(^o) J exp{l.,i9,0*) + m,{^,)}f{D,ie-e,))de 
= e^^(^°) j (piu) f{u) du 

= e^^(^o) ]Ef{-f). (B.3) 

Similarly, if mg(^g) is defined by (B.l) with e in place of e, then the value ?7ig(^g) + 
Lg(0,0*) is (conditionally on Y) the density of the normal law with the mean 0g = 
Z)~^^g + 6* and the covariance matrix D^^ . For any nonnegative function / , it holds 

[ exp{L{e,e*)}f{D,{9-e,))do 

> exp{-^,(ro) - me(0} / 4>{u)f{u) ^D,\u + Q G eo{ro)}du. (B.4) 
A special case of (B.4) with f{u) = 1 implies by definition of z^e(i'o) : 

/ exp{L{e,0*)}de > exp{-Oe(ro) -me(0 -z^e(ro)}. (B.5) 

J 00 {to) 

Now we are prepared to finalize the proof of the theorem. (B.3) and (B.5) imply 

exp{L(0, e*)}f{D,{e - e,)) do 
fexp{L{e,e*)}de 

< exp{0,(ro) + Oe{ro) + mJ,Q - m,(^J + z.,(ro)} ]Ef{-f) 
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and (2.8) follows by definition of me(^g) , m^{$,^) , and Z\e(ro) . 

It remains to evaluate the constants and t'e(ro) entering in the definition of 
The value '= logdet(L'gDg'^) can be easily bounded via : 

= ilogdet(Z)o-iZ)2D-i) - hogdet{D^'DlD^') 

< - log(l + Te) - - log(l - Te) = -- log' 



If Te < 0.5 , then this implies < LlT^p. By (X) , it holds ||-D|V"o"^||oo < 0^(1 + Te) ■ 
This yields for 7 ~ ?\f(0, Ip) and Tq > Cp 

Mro) = -log]P{\\VoD2\-f + L)\\ <ro\Y) 
< -logiP(||7f <p + Cx„) <e-^". 

B.2 Proof of Corollary 2.4 

The first fact is a direct implication of (2.8). For the second one we use that 

exp{X^D,{^ - 6>e)} = exp{XjD,{^ - 0,)} e^p{X^ 0,(0, - 6,)} (B.6) 
with Ai = D^ ^DeX ■ Furthermore, 

\\\if = \'D-^DlD-^\<{l + Te)\\\f < \\Xf + Tep. 
It remains to bound De{6e — 0^) ■ It holds by definition 

\\De{ee-ee)\\ = Ue-DeD-^u\ 

= \\{D,'DlD,^ - Ip)DeD~'U < Te\\DeD-'^J < 2Te|||,||. 
This yields the result by (2.10) and (B.6) in view of ||A|p < p. 

B . 3 Proof of Theorem 2 . 5 

Define u(0) =h\\Vo{0 - 0*)f /2 . Then it holds 

-]EL{0,0*) -u{0) > b 111/0(6' -6'*)||V2- 



Now we apply Corollary 4.2 of Spokoiny (2012) in which b(r) is replaced by b/2 . This 
result ensures that u{0) is an upper function for L{0,0*) . Furthermore, By a change 
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of variables, one obtains 
det(Vb 



(27r)?'/2 

bP/2 det(yo) 



exp{-u{e)}de 

0\0o 



< 



/ exp{-b||W-r)||V2}d0 = JP(||7f >brg). 
[InjP/^ J0\eo 



For the integral in the nominator of (2.12), it holds on f?(x) by (2.3) for any non- 
negative function /(•) 

[ exp{L{e,e*)} f{e)de < [ exp{-u{e)} f{e)de. (b.7) 

J0\0o J0\0O 

The bound (B.5) for the local integral Jg,^^ exp|L(0, 0*)}(i0 implies that 



p(ro) < exp{Oe(ro) + ^^^(ro) + / exp{-u(6>)}(i6'. 

J0\0o 

Finally 

exp{m,(0} = exp{-||4||V2} {27r)'P/^ det{D,) < {27r)-P/^ det{D,) 
and the assertion follows. 

B.4 Proof of Theorem 2.7 

On the set n{x) , it holds by (B.3) with /(•) = 1 , (2.3) and (B.7): 

[ exp{L{9,9*)}de < [ exp{L{G,G*)} dG + j exp{L{e,G*)] dO 

J J0o J0\0o 

< {l + p(ro)} [ exp{L{e,e*)}de 

J0o 

< {1 + p(ro)} exp{06(ro) - ^^(^ J + i^e(ro)} 

< exp{Oe(ro) - nieii^) + t'e(ro) + /?(ro)}. 

This and the bound (B.4) imply 

/0o(ro) exp{L(0, e*)]f[D,{e - e,)) do 

Jexp{L{9,e*)}de 
exp{-0,(ro) - me(0} / Hu)fiu) ^u G S,(ro)}c?M 



> 



exp{Oe(ro) - ruei^e) + z^e(ro) + p{ro)} 
> exp{-Z\-(ro)} iE[/(7) ^'Y G ^e{ro)}] 
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where 

S,(ro) = {ueSF : Dl\u + Q G 0o(ro)} 
= {ueM': \\VQD,\u + Q\\<ro]. 

The bounds < Cp, Dl<{l + Te)D^ , and Vb < aDo hiiply 

Se(ro) ^ {uelRP: \\u\\ < ro}. 

This yields (2.14). 

B.5 Proof of Corollary 2.8 

The result follows from Theorem 2.7. The only important additional step is an evaluation 
of the integral lE{ex.p{X^-y) 1I(||7|| < r)} . 

Lemma B.l. Let 7 ~ J^{0,lp) , ^ G (0, 1) . Then for any vector A € with ||Ap < p 
logiE{exp(A"^7) H\h\\ > r)} < "^^r^ + i-||Af + (p/2) log(/x-i). (B.8) 
Proof. We use that for /i < 1 

iB{exp(AT7) 1(11711 >r)} <e-(i-'^)-'/2iBexp{AT7 + (l-/i)ll7llV2} 

It holds 

JEexp{AT7 + (l-^)||7||V2} = (2^)-^/' J exp{ A"r7 - ^IItII V2}d7 

= /i-P/2gxp(^-i||A||V2) 

and (B.8) follows. □ 

Now we apply this result with fi = 1/2 . 
Lemma B.2. Let 7 ~ N(0,/p) , and let > 6{p + x) . Then for any A G iR^ wii/i 

iB{exp(AT7) 1(11711 <r)} > ell^ll'/2(i _ e-3-/2) . (b.9) 

Proof. The result (B.8) applied with fi = 1/2 yields in view of iBexp(A'^7) = ell^ll'/2^ 
||Af <p,and 1 + 0.5 log(2) < 3/2 that 

e-ll^f/2^|exp(A^7)]l(||7|| < r)} 

> 1 - exp(-rV4 + P+ {p/2) log(2)) > 1 - exp(-3x/2) 

and (B.9) follows. □ 
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The last bound yields the assertion of Corollary 2.8 in view of log(l — e~'^^/^) > — e~^ 
for X > 1 . 

B.6 Proof of Theorem 2.9 

Due to our previous results, it is convenient to decompose the r.v. i9 in the form 

^ = 0o(ro)} 0o(ro)} = ^° + 

The large deviation results yields that the posterior distribution of the part i)'^ is neg- 
ligible provided a proper choice of rg . Below we show that i?° is nearly normal which 
yields the BvM result. Define 

i9o = lE°'d, el = Cov(i9°) =^ - -i9o)(i? - i?o)^}. 

It suffices to show that holds on 17„ 

iiL»,(i?o-0.)f < ca: 

\\lp-D,6lD,\\^ < CA:, 

and, for any A G with ||A|p < p 

logiE°exp{A'^©~^(i9 - i9o)} - ||A||V2 < CZ\*. (B.IO) 

Consider rj D^{'d — 6^) . Corollary 2.4 and 2.8 yield for any A G IW with ||A|p < p 
that 

||Af exp(-Z\") < lE°\X^r]\'^ < || Af exp(Z\+), (B.ll) 

\\Xf/2-A~ < loglE°exp{X^r]) < ||A||V2 + Z^+. (B.12) 

with A~ = Af and A~^ = Af . Define the first two moments of rj : 
Use the following technical statement. 

Lemma B.3. Assume (B.ll). Then with A* = max{Z\+,Z\~} < 1/2 

llr/f < CA*, 1152 - /piloo < CA*. (B.13) 
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Proof. Let u be any unit vector in . We obtain from (B.ll) 

]E°\u'^r]f < exp(Z\+) 
]E°\u'^rjf > exp(-Z\-). 



Note now that 

]E°\u^rjf = u^Slu + \u^r]\^. 

Hence 

exp(-Z\~) < vJslu + \v7r}\^ < exp(Z\+). (B.14) 

In a similar way with u = fj/WrjW and 7 ~ 3sf(0, Ip) 

u^Slu = E°\vJ{'q-r))\^ 

> exp(-Z^-)iB|u^(7 - 77)!^ = exp(-Z\~)(l + ) 

yielding 

v7 Slu> {l+\\r)f)eM-A-)- 
This inequality contradicts (B.14) if Hr/f > 2A* for A* <l, and (B.13) follows. □ 
The bound for the first moment implies with i?o = JE°'d 

||De(i?o - 6',)f < CZ\* (B.15) 
while the second bound yields with &l = ]E° {{'d - ■do)[-d - -do)^} 

\\D,GlD,-Ip\\^<CA*. (B.16) 
Now we consider the moment generating function of 77^ =^ G^^{'d — -do) . We use that 

A^r7„ = XJdJ,^ - e,) + XjD.ie, - i9o) = xjv + >^jDeiee - i?o) 
with Ai = D^^G^^X . Therefore, it holds 

logiE°exp{ATr7j -logiE°exp{A7r7}| < \X^ G^^O,- (B.17) 
By (B.16), it holds for ||Ai||2 = X'^ D^^G~'^D^^X 

(1 - CZ\~)||Af < A^D^^e^^^^^A < (1 + CZ\+)||Af . 
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This, (B.15), (B.17), and (B.12) yield 

logiE;°exp{A^r?„} - ||A||V2 



< CA*. 



This imphes (B.IO). 
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